[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs#3593
Closed
dbsanfte wants to merge 4 commits intoROCm:developfrom
Closed
[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs#3593dbsanfte wants to merge 4 commits intoROCm:developfrom
dbsanfte wants to merge 4 commits intoROCm:developfrom
Conversation
Problem: DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (M=1 decode case). Root cause: CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default. Without this, invalid buffer loads execute and crash before bounds checking can prevent them. Solution: 1. Enable the OOB offset trick (0x80000000) so invalid coordinates safely return zero instead of accessing unmapped memory 2. Use full coordinate_has_valid_offset() check instead of the _assuming_visible_index_is_valid variant for proper K bounds validation Verified with INT8 GEMM tests: M=1 decode, K=14336, FFN projections.
Problem: When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types. The element op contract is: (E& e, const C& c, const D& d...) where: - E = FloatC (final output type, e.g., float) - C = FloatAcc (accumulator type, e.g., int32_t) Original code used generate_tie() returning the same c_thread_buf for both E& and C&, which: 1. Violates the element op signature when types differ 2. Causes compile errors with strictly-typed element ops 3. Results in undefined behavior during ThreadwiseTensorSliceTransfer Solution: Introduce separate e_thread_buf<FloatC> for element op output, pass (E& e) from e_thread_buf and (const C& c) from c_thread_buf, then transfer e_thread_buf to global memory. Bug has existed since the file was created in December 2022 (PR ROCm#517).
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses critical bugs in ComposableKernel's DeviceGemmDl implementation for gfx906 (MI50/MI60) GPUs that lack MFMA instructions. The fixes resolve out-of-bounds memory access crashes and type safety issues affecting INT8 GEMM operations with mixed accumulator and output types.
Changes:
- Enable hardware-assisted OOB protection by default for buffer loads on gfx906
- Fix bounds checking logic to validate all coordinates, preventing invalid memory accesses
- Correct type mismatches in element operations when accumulator type differs from output type
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| include/ck/ck.hpp | Enables offset trick for OOB buffer load protection by default |
| include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp | Strengthens bounds validation to check all coordinates including visible indices |
| include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp | Introduces separate buffer for element op output with correct type (FloatC) to fix type mismatch when FloatAcc differs from FloatC |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add CK_GFX906_DEBUG macro for conditional debug output - Log GEMM parameters (M, N, K, strides) for gfx906 devices - Track which device GEMM variants are being invoked - Helps diagnose launch bounds and occupancy issues on older GCN
cf801de to
5046c60
Compare
- Comment out always-on std::cout debug spam in device_gemm_multiple_d_dl.hpp - Add optional CK_DEBUG_KERNEL-gated logging in gridwise_gemm_dl_v1r3.hpp - Fixes console spam on every GEMM call for gfx906 devices
5503add to
58b17c3
Compare
|
Imported to ROCm/rocm-libraries |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR is an aggregation of fixes discovered while working with ComposableKernel on gfx906 (MI50/MI60) GPUs. These GPUs don't have MFMA instructions, so they rely on the
DeviceGemmDlpath which has some edge cases that aren't well-tested.Note: This is a draft PR that will be updated as we discover more issues.
Fix 1: Buffer Load OOB Crash with Large K and Small M
Problem
DeviceGemmDlcrashes on gfx906 when K >= 1472 with small M (e.g., M=1 decode case in LLM inference).The crash occurs in
gridwise_gemm_dl_v1r3.hppduringblock_sync_lds()after an invalid buffer load.Root Cause
CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICKwas disabled by default (set to 0).Without the offset trick:
With the offset trick enabled:
0x80000000added to offsetSolution
include/ck/ck.hpp: EnableCK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICKby defaultinclude/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp: Usecoordinate_has_valid_offset()instead ofcoordinate_has_valid_offset_assuming_visible_index_is_valid()for full bounds validationVerification
INT8 GEMM tests pass for:
Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)
Problem
When
FloatAccdiffers fromFloatC(e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types.The element op contract is:
(E& e, const C& c, const D& d...)where:E=FloatC(final output type, e.g.,float)C=FloatAcc(accumulator type, e.g.,int32_t)Root Cause
Original code at lines 615-618 used
generate_tie()returning the samec_thread_buffor bothE&andC&:This causes:
FloatAcc != FloatC(element op expectsfloat&fore, getsint32_t&)ThreadwiseTensorSliceTransferwhich type-punsFloatAccbits asFloatCThis bug has existed since the file was created in December 2022 (PR #517).
Solution
include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp:e_thread_buf<FloatC>for element op output(E& e)frome_thread_bufand(const C& c)fromc_thread_bufusingtie()e_thread_buf(notc_thread_buf) to global memoryMinimal Repro
See original PR #3565 for compile-time repro that demonstrates the type mismatch.
Environment